Parallel Computing

Author

Jonathan Chipman with content provided by George Vega Yon

Introduction

Today’s content has been drawn from:

  1. https://github.com/gvegayon/ocrug-hpc-august2019

  2. Roger Peng’s book: R Programming for data science, Ch 22

  3. Norman Matloff’s book: The Art of R Programming, Ch 16

High-Performance Computing: An overview

Loosely, from R’s perspective, we can think of HPC in terms of two, maybe three things:

  1. Big data: How to work with data that doesn’t fit your computer

  2. Parallel computing: How to take advantage of multiple core systems

  3. Compiled code: Write your own low-level code (if R doesn’t have it yet…)

(Checkout CRAN Task View on HPC)

Some vocabulary for HPC

In raw terms

  • Supercomputer: A single big machine with thousands of cores/gpus.

  • High Performance Computing (HPC): Multiple machines within a single network.

  • High Throughput Computing (HTC): Multiple machines across multiple networks.

You may not have access to a supercomputer, but certainly HPC/HTC clusters are more accessible these days, e.g. AWS provides a service to create HPC clusters at a low cost

What’s “a core”?

Taxonomy of CPUs (Downloaded from de https://slurm.schedmd.com/mc_support.html)

Now, how many cores does your computer has, the parallel package can tell you that:

parallel::detectCores()
[1] 10

What is parallel computing, anyway?

f <- function(n) n*2
f(1:4)

Here we are using a single core. The function is applied one element at a time, leaving the other 3 cores without usage.

f <- function(n) n*2
f(1:4)

In this more intelligent way of computation, we are taking full advantage of our computer by using all 4 cores at the same time. This will translate in a reduced computation time which, in the case of complicated/long calculations, can be an important speed gain.

When is it a good idea?

Let’s think before we start…

“In almost any parallel-processing application, you encounter overhead, or “wasted” time spent on noncomputational activity.” Matloff, ch 16, pg 337. See 16.4 for sources of overhead.

Ask yourself these questions before jumping into HPC!

Parallel computing in R

While there are several alternatives (just take a look at the High-Performance Computing Task View), we’ll focus on the following R-packages for explicit parallelism.

  • parallel: R package that provides ‘[s]upport for parallel computation, including random-number generation’.
  • foreach: R package for ‘general iteration over elements’ in parallel fashion.
  • future: ‘[A] lightweight and unified Future API for sequential and parallel processing of R expression via futures.’ (won’t cover here)

Implicit parallelism or hidden parallelism, on the other hand, are out-of-the-box tools that allow the programmer not to worry about parallelization, e.g. such as gpuR for Matrix manipulation using GPU, tensorflow, and data.table

And there’s also a more advanced set of options

  • Rcpp + OpenMP: Rcpp is an R package for integrating R with C++, and OpenMP is a library for high-level parallelism for C/C++ and Fortran.
  • A ton of other type of resources, notably the tools for working with batch schedulers such as Slurm, HTCondor, etc.

The parallel package

  • Based on the snow (simple network of workstations) and multicore R Packages.

  • Explicit parallelism.

  • Simple yet powerful idea: Parallel computing as multiple R sessions.

  • Clusters can be made of both local and remote sessions

  • Multiple types of cluster: PSOCK, Fork, MPI, etc.

Differences between socket and fork

“For the fork, each parallel thread is a complete duplication of the master process with the shared environment, including objects or variables defined prior to the kickoff of parallel threads. Therefore, it runs fast. However, the major limitation is that the fork doesn’t work on the Windows system.”

“On the other hand, the socket works on all operating systems. Each thread runs separately without sharing objects or variables, which can only be passed from the master process explicitly. As a result, it runs slower due to the communication overhead.”

Parallel workflow

(Usually) We do the following:

  1. Create a PSOCK/FORK (or other) cluster using makePSOCKCluster/makeForkCluster (or makeCluster)

  2. Copy/prepare each R session (if you are using a PSOCK cluster):

    1. Copy objects with parallel::clusterExport

    2. Pass expressions with parallel::clusterEvalQ

    3. Set a seed

  3. Do your call: parallel::parApply, parallel::parLapply, etc.

  4. Stop the cluster with parallel::clusterStop

parallel::clusterApply documentation

parallel documentation

Ex 1: Hello world!

# 1. CREATING A CLUSTER
library(parallel)
cores <- parallel::detectCores()
cl <- parallel::makePSOCKcluster(cores)    
x  <- 20

# 2. PREPARING THE CLUSTER
parallel::clusterSetRNGStream(cl, 123) # Equivalent to `set.seed(123)`
parallel::clusterExport(cl, "x")

# 3. DO YOUR CALL
parallel::clusterEvalQ(cl, {
  paste0("Hello from process #", Sys.getpid(), ". I see x and it is equal to ", x)
})
[[1]]
[1] "Hello from process #97409. I see x and it is equal to 20"

[[2]]
[1] "Hello from process #97412. I see x and it is equal to 20"

[[3]]
[1] "Hello from process #97408. I see x and it is equal to 20"

[[4]]
[1] "Hello from process #97407. I see x and it is equal to 20"

[[5]]
[1] "Hello from process #97405. I see x and it is equal to 20"

[[6]]
[1] "Hello from process #97404. I see x and it is equal to 20"

[[7]]
[1] "Hello from process #97411. I see x and it is equal to 20"

[[8]]
[1] "Hello from process #97410. I see x and it is equal to 20"

[[9]]
[1] "Hello from process #97406. I see x and it is equal to 20"

[[10]]
[1] "Hello from process #97413. I see x and it is equal to 20"
# 4. STOP THE CLUSTER
parallel::stopCluster(cl)

Ex 2: Parallel regressions

Problem: Run multiple regressions for each column of a very wide dataset. Report the \(\beta\) coefficients from fitting the following regression models:

\[ y = X_{i}\beta_i + \varepsilon_{i},\quad \varepsilon_{i} \sim N(0, \sigma^2_i) \]

\[ \quad\forall \text{ columns } i \in \left\{1, \dots, 999 \right\} \]

set.seed(131)
y <- rnorm(500)
X <- matrix(rnorm(500*999), nrow = 500, dimnames = list(1:500, sprintf("x%03d", 1:999)))
dim(X)
[1] 500 999
X[1:6, 1:5]
         x001        x002       x003       x004      x005
1  0.61827227  1.72847041 -1.4810695 -0.2471871 1.4776281
2  0.96777456 -0.19358426 -0.8176465  0.6356714 0.7292221
3 -0.04303734 -0.06692844  0.9048826 -1.9277964 2.2947675
4  0.84237608 -1.13685605 -1.8559158  0.4687967 0.9881953
5 -1.91921443  1.83865873  0.5937039 -0.1410556 0.6507415
6  0.59146153  0.81743419  0.3348553 -1.8771819 0.8181764
str(y)
 num [1:500] -0.8188 -0.5438 1.0209 0.0467 -0.4501 ...

Serial solution:

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Use apply (for loop) to solve it



ans <- apply(
  X      = X,
  MARGIN = 2,
  FUN    = function(x, y) coef(lm(y ~ x)),
  y      = y
  )

ans[,1:5]
                   x001        x002        x003        x004        x005
(Intercept) -0.03449819 -0.03339681 -0.03728140 -0.03644192 -0.03717344
x           -0.06082548  0.02748265 -0.01327855 -0.08012361 -0.04067826

Ex 2: Parallel regressions (cont’d 2)

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Parallel solution: Use parApply

library(parallel)
cl <- makePSOCKcluster(4L)
ans <- parApply(
  cl     = cl,
  X      = X,
  MARGIN = 2,
  FUN    = function(x, y) coef(lm(y ~ x)),
  y      = y
  )

ans[,1:5]
                   x001        x002        x003        x004         x005
(Intercept) -0.02094065 -0.01630664 -0.01565541 -0.01848773 -0.015305816
x            0.09269758 -0.05233096  0.02893588  0.02404687  0.009151671

Are we going any faster? Compare the timing using the bench package.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

library(bench)
mark(
  parallel = parApply(
    cl  = cl,
    X   = X, MARGIN = 2,
    FUN = function(x, y) coef(lm(y ~ x)),
    y   = y
    ),
  serial = apply(
    X   = X, MARGIN = 2,
    FUN = function(x, y) coef(lm(y ~ x)),
    y   = y
    )
)
# A tibble: 2 × 6
  expression      min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 parallel     74.9ms   78.4ms     12.5     11.8MB     5.36
2 serial      228.4ms  229.9ms      4.31    98.8MB    41.7 

How can you create a general purpose parallel computing package specific to OS?

There may be smarter ways, though this is how I approached it: SeqSGPV package

See also

For more, checkout the CRAN Task View on HPC

Bonus track: Simulating \(\pi\)

  • We know that \(\pi = \frac{A}{r^2}\). We approximate it by randomly adding points \(x\) to a square of size 2 centered at the origin.

  • So, we approximate \(\pi\) as \(\Pr\{\|x\| \leq 1\}\times 2^2\)

The R code to do this

pisim <- function(i, nsim) {  # Notice we don't use the -i-
  # Random points
  ans  <- matrix(runif(nsim*2), ncol=2)
  
  # Distance to the origin
  ans  <- sqrt(rowSums(ans^2))
  
  # Estimated pi
  (sum(ans <= 1)*4)/nsim
}

library(parallel)
# Setup
cl <- makePSOCKcluster(4L)
clusterSetRNGStream(cl, 123)

# Number of simulations we want each time to run
nsim <- 1e5

# We need to make -nsim- and -pisim- available to the
# cluster
clusterExport(cl, c("nsim", "pisim"))

# Benchmarking: parSapply and sapply will run this simulation
# a hundred times each, so at the end we have 1e5*100 points
# to approximate pi
rbenchmark::benchmark(
  parallel = parSapply(cl, 1:100, pisim, nsim=nsim),
  serial   = sapply(1:100, pisim, nsim=nsim), replications = 1
)[,1:4]
      test replications elapsed relative
1 parallel            1   0.085    1.000
2   serial            1   0.277    3.259

Not yet discussed: RcppArmadillo and OpenMP

  • Friendlier than RcppParallel… at least for ‘I-use-Rcpp-but-don’t-actually-know-much-about-C++’ users (like myself!).

  • Must run only ‘Thread-safe’ calls, so calling R within parallel blocks can cause problems (almost all the time).

  • Use arma objects, e.g. arma::mat, arma::vec, etc. Or, if you are used to them std::vector objects as these are thread safe.

  • Pseudo-Random Number Generation is not very straight forward… But C++11 has a nice set of functions that can be used together with OpenMP

  • Need to think about how processors work, cache memory, etc. Otherwise you could get into trouble… if your code is slower when run in parallel, then you probably are facing false sharing

  • If R crashes… try running R with a debugger (see Section 4.3 in Writing R extensions):

    ~$ R --debugger=valgrind

Session info

devtools::session_info()
─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.2.1 (2022-06-23)
 os       macOS Monterey 12.6
 system   aarch64, darwin20
 ui       X11
 language (EN)
 collate  en_US.UTF-8
 ctype    en_US.UTF-8
 tz       America/Denver
 date     2023-02-14
 pandoc   2.19.2 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/ (via rmarkdown)

─ Packages ───────────────────────────────────────────────────────────────────
 package     * version date (UTC) lib source
 cachem        1.0.6   2021-08-19 [1] CRAN (R 4.2.0)
 callr         3.7.2   2022-08-22 [1] CRAN (R 4.2.0)
 cli           3.4.1   2022-09-23 [1] CRAN (R 4.2.0)
 crayon        1.5.2   2022-09-29 [1] CRAN (R 4.2.0)
 devtools      2.4.5   2022-10-11 [1] CRAN (R 4.2.0)
 digest        0.6.29  2021-12-01 [1] CRAN (R 4.2.0)
 ellipsis      0.3.2   2021-04-29 [1] CRAN (R 4.2.0)
 evaluate      0.16    2022-08-09 [1] CRAN (R 4.2.0)
 fastmap       1.1.0   2021-01-25 [1] CRAN (R 4.2.0)
 fs            1.5.2   2021-12-08 [1] CRAN (R 4.2.0)
 glue          1.6.2   2022-02-24 [1] CRAN (R 4.2.0)
 htmltools     0.5.3   2022-07-18 [1] CRAN (R 4.2.0)
 htmlwidgets   1.5.4   2021-09-08 [1] CRAN (R 4.2.0)
 httpuv        1.6.6   2022-09-08 [1] CRAN (R 4.2.0)
 jsonlite      1.8.0   2022-02-22 [1] CRAN (R 4.2.0)
 knitr         1.40    2022-08-24 [1] CRAN (R 4.2.0)
 later         1.3.0   2021-08-18 [1] CRAN (R 4.2.0)
 lifecycle     1.0.3   2022-10-07 [1] CRAN (R 4.2.0)
 magrittr      2.0.3   2022-03-30 [1] CRAN (R 4.2.0)
 memoise       2.0.1   2021-11-26 [1] CRAN (R 4.2.0)
 mime          0.12    2021-09-28 [1] CRAN (R 4.2.0)
 miniUI        0.1.1.1 2018-05-18 [1] CRAN (R 4.2.0)
 pkgbuild      1.3.1   2021-12-20 [1] CRAN (R 4.2.0)
 pkgload       1.3.0   2022-06-27 [1] CRAN (R 4.2.0)
 prettyunits   1.1.1   2020-01-24 [1] CRAN (R 4.2.0)
 processx      3.7.0   2022-07-07 [1] CRAN (R 4.2.0)
 profvis       0.3.7   2020-11-02 [1] CRAN (R 4.2.0)
 promises      1.2.0.1 2021-02-11 [1] CRAN (R 4.2.0)
 ps            1.7.1   2022-06-18 [1] CRAN (R 4.2.0)
 purrr         0.3.5   2022-10-06 [1] CRAN (R 4.2.0)
 R6            2.5.1   2021-08-19 [1] CRAN (R 4.2.0)
 Rcpp          1.0.10  2023-01-22 [1] CRAN (R 4.2.0)
 remotes       2.4.2   2021-11-30 [1] CRAN (R 4.2.0)
 rlang         1.0.6   2022-09-24 [1] CRAN (R 4.2.0)
 rmarkdown     2.17    2022-10-07 [1] CRAN (R 4.2.0)
 rstudioapi    0.14    2022-08-22 [1] CRAN (R 4.2.0)
 sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.2.0)
 shiny         1.7.2   2022-07-19 [1] CRAN (R 4.2.0)
 stringi       1.7.8   2022-07-11 [1] CRAN (R 4.2.0)
 stringr       1.4.1   2022-08-20 [1] CRAN (R 4.2.0)
 urlchecker    1.0.1   2021-11-30 [1] CRAN (R 4.2.0)
 usethis       2.1.6   2022-05-25 [1] CRAN (R 4.2.0)
 xfun          0.33    2022-09-12 [1] CRAN (R 4.2.0)
 xtable        1.8-4   2019-04-21 [1] CRAN (R 4.2.0)
 yaml          2.3.5   2022-02-21 [1] CRAN (R 4.2.0)

 [1] /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library

──────────────────────────────────────────────────────────────────────────────